The United States is the largest economy in the world. Based on realtrends article, real estate accounts for 16.9% of US's GDP on 2021 which is considered a major GDP contributor. The following analysis aims to provide viewers with real estate insights that may facilitate general understanding and potential investment decisions through market trends and factors influencing the real estate market.
# Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from IPython.display import HTML
HTML('''
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()">
<input type="submit" value="Toggle Code">
</form>
''')
# Importing data
raw_real_estate = pd.read_csv("realtor_data.csv")
This stage contains initial exploration of the real estate dataset. It involves identifying missing values, reformating certain columns for analysis purpose, filtering out duplicated datas and keep thedata required for analysis purpose.
# Initial Exploration
# Identifying the shape of the dataset
shape = raw_real_estate.shape
print(shape)
# identifying the data types of all the columns
data_types = raw_real_estate.dtypes
(923159, 12)
# Identifying the number of missing values in each the columns
missing_values = raw_real_estate.isna().sum().reset_index()
missing_values.columns = ['Column_Names', 'Missing_value_Count']
display(missing_values)
| Column_Names | Missing_value_Count | |
|---|---|---|
| 0 | status | 0 |
| 1 | price | 71 |
| 2 | bed | 131703 |
| 3 | bath | 115192 |
| 4 | acre_lot | 273623 |
| 5 | full_address | 0 |
| 6 | street | 2138 |
| 7 | city | 74 |
| 8 | state | 0 |
| 9 | zip_code | 205 |
| 10 | house_size | 297843 |
| 11 | sold_date | 466763 |
# Changing the date column into "date" data type
raw_real_estate['sold_date'] = pd.to_datetime(raw_real_estate['sold_date'], format = '%Y-%m-%d')
# Removing rows with missing values in certain columns
raw_real_estate = raw_real_estate.dropna(subset = ['price', 'bed', 'bath', 'acre_lot', 'zip_code', 'house_size', 'sold_date'])
## Filtering for datas for years between year 2000 and 2022 (ordered)
# Finding the max and min dates in the dataset
raw_real_estate['sold_date'].agg(['min', 'max'])
# Filtering the date from the range of dates in order
raw_real_estate = raw_real_estate[(raw_real_estate['sold_date'] >= '2000-01-01')
& (raw_real_estate['sold_date'] <= '2022-12-31')].sort_values("sold_date")
# Filtering the columns needed for further analysis
final_real_estate = raw_real_estate.loc[:, (raw_real_estate.columns != 'street') ]
# Removing Duplicate Values
final_real_estate = final_real_estate.drop_duplicates(subset = ["full_address", "sold_date"])
# Reformating the price column
final_real_estate['price'] = final_real_estate['price']/1000
# Create the column called "year"
final_real_estate['year'] = pd.DatetimeIndex(final_real_estate['sold_date']).year
# Renaming the price column
final_real_estate.rename(columns = {"price": "Sale_Price_Thousands"}, inplace = True)
# Adding a new column called Price per Squre Foot (Value for Money)
final_real_estate['price_per_sqft'] = (final_real_estate['Sale_Price_Thousands']/final_real_estate['house_size']) * 1000
# Adding a new column called room ratio (feeling of balance of the house)
final_real_estate['room_ratio'] = final_real_estate['bed']/final_real_estate['bath']
# Downloading the new code
final_real_estate.to_csv("final_real_estate.csv", index = False)
display(final_real_estate.head(10))
| status | Sale_Price_Thousands | bed | bath | acre_lot | full_address | city | state | zip_code | house_size | sold_date | year | price_per_sqft | room_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 920877 | for_sale | 225.0 | 3.0 | 2.0 | 1.18 | 21 Morris Dr, Newburgh, NY, 12550 | Newburgh | New York | 12550.0 | 1245.0 | 2000-01-03 | 2000 | 180.722892 | 1.500000 |
| 634864 | for_sale | 451.0 | 4.0 | 3.0 | 0.29 | 16 Underwood Ct, Burlington, NJ, 08016 | Burlington | New Jersey | 8016.0 | 2085.0 | 2000-01-04 | 2000 | 216.306954 | 1.333333 |
| 627427 | for_sale | 289.0 | 1.0 | 1.0 | 0.13 | 113 Cleveland Ln, Rockaway, NJ, 07866 | Rockaway | New Jersey | 7866.0 | 915.0 | 2000-01-04 | 2000 | 315.846995 | 1.000000 |
| 229776 | for_sale | 350.0 | 3.0 | 2.0 | 0.15 | 31 Benefit St, Pawtucket, RI, 02861 | Pawtucket | Rhode Island | 2861.0 | 1500.0 | 2000-01-04 | 2000 | 233.333333 | 1.500000 |
| 448440 | for_sale | 365.0 | 2.0 | 1.0 | 0.95 | 49 Tinker Hill Rd, Washington, CT, 06777 | Washington | Connecticut | 6777.0 | 720.0 | 2000-01-04 | 2000 | 506.944444 | 2.000000 |
| 590524 | for_sale | 949.9 | 5.0 | 5.0 | 5.04 | 5 Country Ln, Tewksbury Township, NJ, 07830 | Tewksbury Township | New Jersey | 7830.0 | 4508.0 | 2000-01-04 | 2000 | 210.714286 | 1.000000 |
| 778538 | for_sale | 695.0 | 6.0 | 2.0 | 0.07 | 1324 Star Ave, Elmont, NY, 11003 | Elmont | New York | 11003.0 | 1770.0 | 2000-01-04 | 2000 | 392.655367 | 3.000000 |
| 612420 | for_sale | 85.0 | 3.0 | 1.0 | 0.02 | 1126 Chestnut St, Wilmington, DE, 19805 | Wilmington | Delaware | 19805.0 | 1075.0 | 2000-01-04 | 2000 | 79.069767 | 3.000000 |
| 449429 | for_sale | 349.0 | 3.0 | 2.0 | 0.18 | 11 Dartmouth Ln, Danbury, CT, 06810 | Danbury | Connecticut | 6810.0 | 1170.0 | 2000-01-05 | 2000 | 298.290598 | 1.500000 |
| 686596 | for_sale | 154.9 | 4.0 | 1.0 | 0.22 | 216 South Ave, Bridgeton, NJ, 08302 | Bridgeton | New Jersey | 8302.0 | 1008.0 | 2000-01-05 | 2000 | 153.670635 | 4.000000 |
A usual business cycle can be identified in the time series plot below where the peaks occured at 2005 and 2017 and the downturn at 2000, 2011 and 2022. The downturn occured at 2011 maybe considered weak recoveries from the 2008 global financial crisis and 2022 downturn maybe caused by the covid pandemic and the rise of interest rates.
# Finding the sum of real estate sales Volume in each of the year
General_Sales_Volume = final_real_estate['year'].value_counts().sort_index()
# Plotting the Real Estate Sales Volume over the years
Sales_Volume_Plot = General_Sales_Volume.plot(title = "Total Real Estate Sales Volume from 2000 to 2022",
xlabel = "year", ylabel = "Sales_Volume", kind = "line")
The real estate sales value time series plot below shows a similar pattern as the sales volume time series plot. This may be due to the large value each real estate carries which makes changes in real estate sales very obvious.
# Finding the sum of real estate sales Value in each of the year
General_Sales_Values = pd.DataFrame(final_real_estate.groupby('year')['Sale_Price_Thousands'].sum())
Sales_Revenue_Plot = General_Sales_Values.plot(title = "Total Real Estate Sales Values from 2000 to 2022",
ylabel = "Sales_Price_Thousands", kind = "line", legend = False)
Currently Virgin islands is a state with the highest median property price however, there were only datas on 2 properties listed on Virgin Island and one of the property price was heavily skewed thus creating an outlier. Therefore, its not a good indicator of a median prices in Virgin Islands and thus will be removed from the chart for more accurate analysis.
# find the median house prices in each state
median_prices = final_real_estate.groupby("state")['Sale_Price_Thousands'].median().reset_index().sort_values(by='Sale_Price_Thousands', ascending=False)
median_prices_except_highest = median_prices.iloc[1:]
plt.figure(figsize=(12, 6))
sns.barplot(data=median_prices_except_highest, x='state', y='Sale_Price_Thousands', palette='viridis')
plt.xticks(rotation=90)
plt.xlabel("State")
plt.ylabel("Median Property Price (Thousands $)")
plt.title("Median Property Price by State")
plt.show()
# identifying all the properties from Virgin Islands
virgin_islands_rows = final_real_estate[final_real_estate["state"] == "Virgin Islands"]
display(virgin_islands_rows)
| status | Sale_Price_Thousands | bed | bath | acre_lot | full_address | city | state | zip_code | house_size | sold_date | year | price_per_sqft | room_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11371 | for_sale | 950.0 | 5.0 | 4.0 | 0.99 | 46 Water Isle, Saint Thomas, VI, 00802 | Saint Thomas | Virgin Islands | 802.0 | 5000.0 | 2013-10-11 | 2013 | 190.000000 | 1.250000 |
| 10351 | for_sale | 6899.0 | 4.0 | 6.0 | 0.83 | 10 Water Isle, Saint Thomas, VI, 00802 | Saint Thomas | Virgin Islands | 802.0 | 4600.0 | 2018-04-05 | 2018 | 1499.782609 | 0.666667 |
New Jersey has the highest Real Estate sales volume across 22 years while New York is slightly ahead of Pennsylvania.
#2. Identifying the sales volume based on states
# Identifying the number of states available in the US
final_real_estate['state'].nunique()
# Identifying the top 5 states in terms of real estate sales volume
final_real_estate['state'].value_counts().sort_values().tail(5).plot(kind = 'barh',
color = ['Green', 'Blue', 'Orange', 'Brown', 'Purple'])
plt.ylabel("States")
plt.xlabel("Real Estate Sales Volume")
plt.title("Top 5 US States Real Estate Sales Volume")
Text(0.5, 1.0, 'Top 5 US States Real Estate Sales Volume')
New Jersey has the highest Real Estate sales volume each year for 22 years while the New York state has fallen behind Pennsylvania in recent years. Furthermore, Massachusetts have the least variation and lowest real estate sales for the past 12 years. Overall the real estate sales volume in the top 5 states follows a similar trend over the past 22 years.
# Identifying the real estate sales volume in the top 5 states over the years
# Filter out datas that only contains the top 5 states in terms of volume
top_5_states = final_real_estate[final_real_estate.state.isin(["New Jersey", "Connecticut", "New York", "Pennsylvania", "Massachusetts"])]
top_5_states_sales_volume = pd.DataFrame(top_5_states.groupby(['year', 'state']).size()).rename(columns = {0: "State_Count"})
# Plotting the
Palette = ["Brown", "Green", "Purple", "Orange", "Blue"]
sns.set_palette(Palette)
top_5_states_sales_volume = sns.relplot(x = "year", y = "State_Count",
data = top_5_states_sales_volume, kind = "line", hue = "state")
top_5_states_sales_volume.fig.suptitle("Top 5 States Real Estate Sales Volume from year 2000 to 2022", y = 1)
Text(0.5, 1, 'Top 5 States Real Estate Sales Volume from year 2000 to 2022')
The top 5 US states for sales volume and sales value are the same, however, the New York state has taken the first position having the most real estate sales value for the past 22 years.
#2b Identifying the top 5 states based Sale_Price_Thousandson real estate sales value
final_real_estate.groupby('state')['Sale_Price_Thousands'].sum().sort_values().tail(5).plot(kind = "barh",
color = ['Blue', 'Green', 'Brown', 'Purple', "Orange"])
plt.xlabel("Real Estate Sales Value")
plt.ylabel("States")
plt.title("Top 5 US States Real Estate Sales Value")
Text(0.5, 1.0, 'Top 5 US States Real Estate Sales Value')
From the Prices per Sqft analysis, it shows only New York and Massachusetts have median price per sqft that is above 300 dollars while other states have quite similar median price per sqft. This would provide some insights of value for money and the prices of properties in each states, however it is important to note that this metric does not consider qualitative factors such as property layout and design and the benefits specific locations bring.
filtered_data = final_real_estate[~final_real_estate['state'].isin(['Puerto Rico', 'Virgin Islands'])]
state_pricespersqft = filtered_data.groupby('state')['price_per_sqft'].median().sort_values()
# generate unique colours for each state using seaborn
colors = sns.color_palette('tab20', len(state_pricespersqft))
state_pricespersqft.plot(kind = 'barh', color = colors)
plt.xlabel("Prices per Sqft")
plt.ylabel("States")
plt.title("Prices per Sqft Across the US States")
Text(0.5, 1.0, 'Prices per Sqft Across the US States')
state_counts = final_real_estate['state'].value_counts()
display(state_counts)
New Jersey 7348 Connecticut 5113 New York 4240 Pennsylvania 3999 Massachusetts 1920 Rhode Island 1152 New Hampshire 682 Delaware 602 Vermont 432 Maine 363 Virgin Islands 2 Puerto Rico 1 Name: state, dtype: int64
#3 Identify the Information of the Most expensive real estate in each year.
Most_Expensive_Real_Estate = top_5_states.loc[top_5_states.groupby('year')['Sale_Price_Thousands'].idxmax()]
display(Most_Expensive_Real_Estate)
| status | Sale_Price_Thousands | bed | bath | acre_lot | full_address | city | state | zip_code | house_size | sold_date | year | price_per_sqft | room_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 922453 | for_sale | 39000.0 | 8.0 | 13.0 | 8.23 | 555 Lake Ave, Greenwich, CT, 06830 | Greenwich | Connecticut | 6830.0 | 18954.0 | 2000-07-17 | 2000 | 2057.613169 | 0.615385 |
| 649077 | for_sale | 13950.0 | 10.0 | 14.0 | 16.93 | 770 Godfrey Rd, Villanova, PA, 19085 | Villanova | Pennsylvania | 19085.0 | 19000.0 | 2001-03-07 | 2001 | 734.210526 | 0.714286 |
| 560341 | for_sale | 32500.0 | 8.0 | 8.0 | 0.05 | 20 E 73rd St, New York, NY, 10021 | New York | New York | 10021.0 | 11371.0 | 2002-03-21 | 2002 | 2858.147920 | 1.000000 |
| 409975 | for_sale | 60000.0 | 9.0 | 10.0 | 400.00 | 450 Brickyard Rd, Woodstock, CT, 06281 | Woodstock | Connecticut | 6281.0 | 18777.0 | 2003-08-28 | 2003 | 3195.398626 | 0.900000 |
| 495469 | for_sale | 25000.0 | 6.0 | 11.0 | 0.07 | 70 Broad St, New York City, NY, 10004 | New York City | New York | 10004.0 | 19478.0 | 2004-07-12 | 2004 | 1283.499333 | 0.545455 |
| 204657 | for_sale | 15000.0 | 6.0 | 6.0 | 4.50 | 2 Squibnocket Rd, Chilmark, MA, 02535 | Chilmark | Massachusetts | 2535.0 | 4706.0 | 2005-10-04 | 2005 | 3187.420314 | 1.000000 |
| 734864 | for_sale | 18500.0 | 3.0 | 5.0 | 1.94 | 151 E 58th St Ph 55W, New York City, NY, 10022 | New York City | New York | 10022.0 | 4500.0 | 2006-07-11 | 2006 | 4111.111111 | 0.600000 |
| 743986 | for_sale | 20500.0 | 4.0 | 5.0 | 1.33 | New York City, NY, 10023 | New York City | New York | 10023.0 | 3333.0 | 2007-12-20 | 2007 | 6150.615062 | 0.800000 |
| 743290 | for_sale | 50000.0 | 6.0 | 8.0 | 0.22 | 995 5th Ave Unit Ph, New York City, NY, 10028 | New York City | New York | 10028.0 | 6891.0 | 2008-09-12 | 2008 | 7255.840952 | 0.750000 |
| 64850 | for_sale | 15000.0 | 8.0 | 10.0 | 157.00 | 47 White Bridge Rd, Chatham, NY, 12136 | Chatham | New York | 12136.0 | 14058.0 | 2009-10-20 | 2009 | 1067.008109 | 0.800000 |
| 917730 | for_sale | 10600.0 | 8.0 | 11.0 | 4.02 | 488 West Rd, New Canaan, CT, 06840 | New Canaan | Connecticut | 6840.0 | 11923.0 | 2010-07-12 | 2010 | 889.037994 | 0.727273 |
| 743968 | for_sale | 17950.0 | 7.0 | 8.0 | 0.23 | 535 W End Ave Unit 12TH, New York City, NY, 10024 | New York City | New York | 10024.0 | 8451.0 | 2011-01-05 | 2011 | 2124.008993 | 0.875000 |
| 917641 | for_sale | 16000.0 | 7.0 | 9.0 | 47.16 | 187 Umpawaug Rd, Redding, CT, 06896 | Redding | Connecticut | 6896.0 | 7590.0 | 2012-11-28 | 2012 | 2108.036891 | 0.777778 |
| 743414 | for_sale | 45000.0 | 4.0 | 10.0 | 0.05 | 12 E 82nd St, New York City, NY, 10028 | New York City | New York | 10028.0 | 12000.0 | 2013-02-05 | 2013 | 3750.000000 | 0.400000 |
| 463153 | for_sale | 32000.0 | 6.0 | 12.0 | 10.00 | 16 Hurlingham Dr, Greenwich, CT, 06831 | Greenwich | Connecticut | 6831.0 | 23700.0 | 2014-02-21 | 2014 | 1350.210970 | 0.500000 |
| 898460 | for_sale | 19000.0 | 4.0 | 5.0 | 0.05 | 18 W 11th St, New York City, NY, 10011 | New York City | New York | 10011.0 | 6000.0 | 2015-06-09 | 2015 | 3166.666667 | 0.800000 |
| 781352 | for_sale | 18250.0 | 5.0 | 6.0 | 0.05 | 116 E 65th St, New York City, NY, 10065 | New York City | New York | 10065.0 | 6950.0 | 2016-04-28 | 2016 | 2625.899281 | 0.833333 |
| 742416 | for_sale | 16000.0 | 7.0 | 12.0 | 0.05 | 224 W 22nd St, New York City, NY, 10011 | New York City | New York | 10011.0 | 7640.0 | 2017-01-20 | 2017 | 2094.240838 | 0.583333 |
| 751525 | for_sale | 12995.0 | 4.0 | 4.0 | 0.02 | 40 Leroy St, New York City, NY, 10014 | New York City | New York | 10014.0 | 3650.0 | 2018-07-27 | 2018 | 3560.273973 | 1.000000 |
| 640427 | for_sale | 29900.0 | 5.0 | 3.0 | 209.50 | 944 Providence Rd, Newtown Square, PA, 19073 | Newtown Square | Pennsylvania | 19073.0 | 2900.0 | 2019-02-07 | 2019 | 10310.344828 | 1.666667 |
| 461367 | for_sale | 13650.0 | 7.0 | 10.0 | 2.53 | 32 Grahampton Ln, Greenwich, CT, 06830 | Greenwich | Connecticut | 6830.0 | 10534.0 | 2020-06-26 | 2020 | 1295.804063 | 0.700000 |
| 656439 | for_sale | 12500.0 | 5.0 | 6.0 | 0.66 | 5499 Dune Dr, Avalon, NJ, 08202 | Avalon | New Jersey | 8202.0 | 4462.0 | 2021-01-11 | 2021 | 2801.434334 | 0.833333 |
| 678035 | for_sale | 11500.0 | 9.0 | 13.0 | 0.85 | 4816 5th Ave, Avalon, NJ, 08202 | Avalon | New Jersey | 8202.0 | 10000.0 | 2022-12-31 | 2022 | 1150.000000 | 0.692308 |
Out of 22 years, the New York state sold the most expensive real estate in the US for 12 years, outbeating other states by at least 50%. Moreover, this may mean that real estates in New York state tends to appreciate more and/or has higher value due to higher living standards.
#3b At which states are the most expensive real estates sold located in?
plot = Most_Expensive_Real_Estate['state'].value_counts().plot(kind = "bar",
color = ['Orange', 'Brown', 'Blue', 'Purple', "Green"])
plt.xticks(rotation = 0)
plt.xlabel("State")
plt.ylabel("Most Expensive Real Estate Count")
plt.title("Most Expensive Real Estate Count in the Top 5 States")
Text(0.5, 1.0, 'Most Expensive Real Estate Count in the Top 5 States')
From the correlation matrix shown below, turns out number of bathrooms has the strongest positive linear relationship to the sale price compared to other variables. It may be unexpected that the size of the real estate is not the variable with the strongest linear relationship. However, its important to point out that there are other factors that influences real estate prices such as the environment, location choice and the ability of the property to demand higher rents which is influenced by the purpose of using the land.
# Selecting columns for correlation calculation
columns = ['Sale_Price_Thousands', 'bath', 'bed', 'acre_lot', 'house_size']
subset = top_5_states[columns]
display(subset.corr())
| Sale_Price_Thousands | bath | bed | acre_lot | house_size | |
|---|---|---|---|---|---|
| Sale_Price_Thousands | 1.000000 | 0.554682 | 0.296698 | 0.008957 | 0.431435 |
| bath | 0.554682 | 1.000000 | 0.618261 | 0.007481 | 0.600317 |
| bed | 0.296698 | 0.618261 | 1.000000 | -0.000537 | 0.465522 |
| acre_lot | 0.008957 | 0.007481 | -0.000537 | 1.000000 | 0.004948 |
| house_size | 0.431435 | 0.600317 | 0.465522 | 0.004948 | 1.000000 |
The the correlation between House Size and House Price is 0.4314. Which shows that House size does not have a strong positive relationship with the housing prices
correlation = final_real_estate[['house_size', 'Sale_Price_Thousands']].corr()
# Visualize
plt.figure(figsize=(8, 5))
sns.scatterplot(data=final_real_estate, x='house_size', y='Sale_Price_Thousands', alpha=0.4)
plt.title("House Size vs. Price")
plt.xlabel("House Size (sqft)")
plt.ylabel("Price ($)")
plt.show()
The number of bathrom has the strongest positive relationship with the housing price based on the correlation matrix (0.55), however it is important to note that there are other factors that would contribute to housing prices such as the population density of the state and other aspects of the property itself.
correlation = final_real_estate[['bath', 'Sale_Price_Thousands']].corr()
# Visualize
plt.figure(figsize=(8, 5))
sns.scatterplot(data=final_real_estate, x='bath', y='Sale_Price_Thousands', alpha=0.4)
plt.title("Number of Bathrooms vs. Price")
plt.xlabel("Number of Bathrooms")
plt.ylabel("Price ($)")
plt.show()
Based on the table below, a 3 bedroom and 2 bathroom property is the most popular real estate in terms of sales volume every year for the past 22 years. This maybe due to the fact that this is the average family size in the US which drives the most demand for this type of property structure. Moreover, the spike of property sales during 2021 may be because of the covid 19 pandemic which may have affected the financial wellbeing of alot of people, which led to the sale of property. Moreover, the drop in property sales on 2022 may be due to insufficient data collected for the year.
# Value_counts() would remove the column names, but by reindexing, we can set the column names
raw_top_house = final_real_estate.groupby('year')[['bed', 'bath']].value_counts().reset_index()
raw_top_house.columns = ['year', 'bed', ' bath', 'counts']
# Generating the index to identify the max counts for each year (idx is in boolean)
idx = raw_top_house.groupby(['year'])['counts'].transform(max) == raw_top_house['counts']
# Will only return rows that matches True
popular_property = raw_top_house[idx]
import plotly.express as px
# Create the line chart
fig = px.line(
popular_property,
x='year',
y='counts',
markers=True,
title= "Sales Trend for the Most demanded property over the years (3 bedroom and 2 bathrooms)",
labels={'year': 'Year', 'counts': 'Number of Listings'}
)
# Customize hover tooltip (optional)
fig.update_traces(hovertemplate='Year: %{x}<br>Listings: %{y}')
# Show the interactive chart
fig.show(renderer = "notebook")
From the results, property with unbalanced bathroom to bedroom ratio property or properties that are much larger in size (bedroom and bathroom numbers) tend to be less popular due to the average household size which makes such properties harder to sell.
# Value_counts() would remove the column names, but by reindexing, we can set the column names
raw_top_house = final_real_estate.groupby('year')[['bed', 'bath']].value_counts().reset_index()
raw_top_house.columns = ['year', 'bed', ' bath', 'counts']
# Generating the index to identify the max counts for each year (idx is in boolean)
idx_min = raw_top_house.groupby(['year'])['counts'].transform(min) == raw_top_house['counts']
# Will only return rows that matches True
pd.set_option('display.max_rows', 500)
display(raw_top_house[idx_min])
| year | bed | bath | counts | |
|---|---|---|---|---|
| 34 | 2000 | 11.0 | 12.0 | 1 |
| 35 | 2000 | 11.0 | 6.0 | 1 |
| 36 | 2000 | 11.0 | 11.0 | 1 |
| 37 | 2000 | 9.0 | 9.0 | 1 |
| 38 | 2000 | 7.0 | 6.0 | 1 |
| 39 | 2000 | 1.0 | 3.0 | 1 |
| 40 | 2000 | 3.0 | 5.0 | 1 |
| 41 | 2000 | 3.0 | 6.0 | 1 |
| 42 | 2000 | 4.0 | 6.0 | 1 |
| 43 | 2000 | 6.0 | 12.0 | 1 |
| 44 | 2000 | 7.0 | 3.0 | 1 |
| 45 | 2000 | 7.0 | 8.0 | 1 |
| 46 | 2000 | 9.0 | 7.0 | 1 |
| 47 | 2000 | 7.0 | 10.0 | 1 |
| 48 | 2000 | 8.0 | 2.0 | 1 |
| 49 | 2000 | 8.0 | 5.0 | 1 |
| 50 | 2000 | 8.0 | 6.0 | 1 |
| 51 | 2000 | 8.0 | 13.0 | 1 |
| 52 | 2000 | 9.0 | 3.0 | 1 |
| 88 | 2001 | 6.0 | 9.0 | 1 |
| 89 | 2001 | 1.0 | 3.0 | 1 |
| 90 | 2001 | 3.0 | 5.0 | 1 |
| 91 | 2001 | 4.0 | 8.0 | 1 |
| 92 | 2001 | 5.0 | 8.0 | 1 |
| 93 | 2001 | 5.0 | 10.0 | 1 |
| 94 | 2001 | 7.0 | 8.0 | 1 |
| 95 | 2001 | 7.0 | 9.0 | 1 |
| 96 | 2001 | 8.0 | 2.0 | 1 |
| 97 | 2001 | 8.0 | 8.0 | 1 |
| 98 | 2001 | 9.0 | 9.0 | 1 |
| 99 | 2001 | 10.0 | 12.0 | 1 |
| 100 | 2001 | 10.0 | 14.0 | 1 |
| 101 | 2001 | 12.0 | 9.0 | 1 |
| 102 | 2001 | 12.0 | 6.0 | 1 |
| 137 | 2002 | 2.0 | 4.0 | 1 |
| 138 | 2002 | 9.0 | 5.0 | 1 |
| 139 | 2002 | 10.0 | 5.0 | 1 |
| 140 | 2002 | 9.0 | 4.0 | 1 |
| 141 | 2002 | 3.0 | 8.0 | 1 |
| 142 | 2002 | 6.0 | 10.0 | 1 |
| 143 | 2002 | 4.0 | 7.0 | 1 |
| 144 | 2002 | 4.0 | 8.0 | 1 |
| 145 | 2002 | 5.0 | 1.0 | 1 |
| 146 | 2002 | 5.0 | 9.0 | 1 |
| 147 | 2002 | 6.0 | 8.0 | 1 |
| 148 | 2002 | 6.0 | 9.0 | 1 |
| 149 | 2002 | 6.0 | 11.0 | 1 |
| 150 | 2002 | 8.0 | 8.0 | 1 |
| 151 | 2002 | 6.0 | 12.0 | 1 |
| 152 | 2002 | 7.0 | 10.0 | 1 |
| 153 | 2002 | 7.0 | 11.0 | 1 |
| 154 | 2002 | 8.0 | 3.0 | 1 |
| 155 | 2002 | 8.0 | 5.0 | 1 |
| 156 | 2002 | 8.0 | 6.0 | 1 |
| 193 | 2003 | 7.0 | 11.0 | 1 |
| 194 | 2003 | 1.0 | 4.0 | 1 |
| 195 | 2003 | 3.0 | 5.0 | 1 |
| 196 | 2003 | 4.0 | 7.0 | 1 |
| 197 | 2003 | 5.0 | 9.0 | 1 |
| 198 | 2003 | 6.0 | 8.0 | 1 |
| 199 | 2003 | 7.0 | 7.0 | 1 |
| 200 | 2003 | 7.0 | 13.0 | 1 |
| 201 | 2003 | 8.0 | 1.0 | 1 |
| 202 | 2003 | 8.0 | 7.0 | 1 |
| 203 | 2003 | 9.0 | 6.0 | 1 |
| 204 | 2003 | 9.0 | 8.0 | 1 |
| 205 | 2003 | 9.0 | 10.0 | 1 |
| 252 | 2004 | 9.0 | 10.0 | 1 |
| 253 | 2004 | 10.0 | 3.0 | 1 |
| 254 | 2004 | 13.0 | 14.0 | 1 |
| 255 | 2004 | 15.0 | 9.0 | 1 |
| 256 | 2004 | 16.0 | 9.0 | 1 |
| 257 | 2004 | 8.0 | 7.0 | 1 |
| 258 | 2004 | 8.0 | 13.0 | 1 |
| 259 | 2004 | 7.0 | 12.0 | 1 |
| 260 | 2004 | 7.0 | 8.0 | 1 |
| 261 | 2004 | 4.0 | 7.0 | 1 |
| 307 | 2005 | 12.0 | 3.0 | 1 |
| 308 | 2005 | 15.0 | 6.0 | 1 |
| 309 | 2005 | 10.0 | 5.0 | 1 |
| 310 | 2005 | 11.0 | 9.0 | 1 |
| 311 | 2005 | 10.0 | 4.0 | 1 |
| 312 | 2005 | 5.0 | 10.0 | 1 |
| 313 | 2005 | 2.0 | 4.0 | 1 |
| 314 | 2005 | 7.0 | 7.0 | 1 |
| 315 | 2005 | 9.0 | 10.0 | 1 |
| 316 | 2005 | 7.0 | 9.0 | 1 |
| 317 | 2005 | 7.0 | 10.0 | 1 |
| 318 | 2005 | 7.0 | 11.0 | 1 |
| 319 | 2005 | 8.0 | 7.0 | 1 |
| 320 | 2005 | 9.0 | 7.0 | 1 |
| 321 | 2005 | 9.0 | 9.0 | 1 |
| 362 | 2006 | 5.0 | 9.0 | 1 |
| 363 | 2006 | 1.0 | 4.0 | 1 |
| 364 | 2006 | 4.0 | 6.0 | 1 |
| 365 | 2006 | 4.0 | 7.0 | 1 |
| 366 | 2006 | 4.0 | 9.0 | 1 |
| 367 | 2006 | 6.0 | 8.0 | 1 |
| 368 | 2006 | 6.0 | 9.0 | 1 |
| 369 | 2006 | 6.0 | 10.0 | 1 |
| 370 | 2006 | 7.0 | 7.0 | 1 |
| 371 | 2006 | 7.0 | 8.0 | 1 |
| 372 | 2006 | 7.0 | 9.0 | 1 |
| 373 | 2006 | 8.0 | 2.0 | 1 |
| 374 | 2006 | 8.0 | 7.0 | 1 |
| 375 | 2006 | 8.0 | 6.0 | 1 |
| 376 | 2006 | 8.0 | 9.0 | 1 |
| 377 | 2006 | 8.0 | 10.0 | 1 |
| 378 | 2006 | 9.0 | 5.0 | 1 |
| 379 | 2006 | 9.0 | 14.0 | 1 |
| 380 | 2006 | 9.0 | 17.0 | 1 |
| 421 | 2007 | 12.0 | 6.0 | 1 |
| 422 | 2007 | 12.0 | 13.0 | 1 |
| 423 | 2007 | 12.0 | 2.0 | 1 |
| 424 | 2007 | 7.0 | 8.0 | 1 |
| 425 | 2007 | 1.0 | 2.0 | 1 |
| 426 | 2007 | 1.0 | 3.0 | 1 |
| 427 | 2007 | 6.0 | 10.0 | 1 |
| 428 | 2007 | 7.0 | 10.0 | 1 |
| 429 | 2007 | 11.0 | 5.0 | 1 |
| 430 | 2007 | 8.0 | 5.0 | 1 |
| 431 | 2007 | 8.0 | 6.0 | 1 |
| 432 | 2007 | 8.0 | 11.0 | 1 |
| 433 | 2007 | 9.0 | 3.0 | 1 |
| 434 | 2007 | 9.0 | 9.0 | 1 |
| 435 | 2007 | 11.0 | 3.0 | 1 |
| 479 | 2008 | 8.0 | 13.0 | 1 |
| 480 | 2008 | 2.0 | 4.0 | 1 |
| 481 | 2008 | 5.0 | 1.0 | 1 |
| 482 | 2008 | 5.0 | 7.0 | 1 |
| 483 | 2008 | 6.0 | 1.0 | 1 |
| 484 | 2008 | 6.0 | 10.0 | 1 |
| 485 | 2008 | 8.0 | 8.0 | 1 |
| 486 | 2008 | 9.0 | 6.0 | 1 |
| 487 | 2008 | 9.0 | 9.0 | 1 |
| 488 | 2008 | 10.0 | 3.0 | 1 |
| 489 | 2008 | 10.0 | 4.0 | 1 |
| 490 | 2008 | 11.0 | 3.0 | 1 |
| 491 | 2008 | 11.0 | 6.0 | 1 |
| 492 | 2008 | 11.0 | 8.0 | 1 |
| 493 | 2008 | 13.0 | 7.0 | 1 |
| 494 | 2008 | 19.0 | 19.0 | 1 |
| 530 | 2009 | 3.0 | 7.0 | 1 |
| 531 | 2009 | 6.0 | 9.0 | 1 |
| 532 | 2009 | 14.0 | 15.0 | 1 |
| 533 | 2009 | 7.0 | 5.0 | 1 |
| 534 | 2009 | 7.0 | 9.0 | 1 |
| 535 | 2009 | 8.0 | 3.0 | 1 |
| 536 | 2009 | 8.0 | 10.0 | 1 |
| 537 | 2009 | 10.0 | 6.0 | 1 |
| 538 | 2009 | 12.0 | 6.0 | 1 |
| 573 | 2010 | 11.0 | 5.0 | 1 |
| 574 | 2010 | 12.0 | 6.0 | 1 |
| 575 | 2010 | 20.0 | 18.0 | 1 |
| 576 | 2010 | 8.0 | 5.0 | 1 |
| 577 | 2010 | 8.0 | 11.0 | 1 |
| 578 | 2010 | 8.0 | 2.0 | 1 |
| 579 | 2010 | 1.0 | 2.0 | 1 |
| 580 | 2010 | 7.0 | 6.0 | 1 |
| 581 | 2010 | 3.0 | 5.0 | 1 |
| 582 | 2010 | 5.0 | 1.0 | 1 |
| 583 | 2010 | 5.0 | 7.0 | 1 |
| 584 | 2010 | 5.0 | 8.0 | 1 |
| 585 | 2010 | 6.0 | 8.0 | 1 |
| 586 | 2010 | 7.0 | 2.0 | 1 |
| 620 | 2011 | 7.0 | 8.0 | 1 |
| 621 | 2011 | 5.0 | 8.0 | 1 |
| 622 | 2011 | 1.0 | 3.0 | 1 |
| 623 | 2011 | 4.0 | 6.0 | 1 |
| 624 | 2011 | 4.0 | 7.0 | 1 |
| 625 | 2011 | 6.0 | 10.0 | 1 |
| 626 | 2011 | 7.0 | 2.0 | 1 |
| 627 | 2011 | 7.0 | 6.0 | 1 |
| 628 | 2011 | 7.0 | 10.0 | 1 |
| 629 | 2011 | 7.0 | 12.0 | 1 |
| 630 | 2011 | 8.0 | 3.0 | 1 |
| 631 | 2011 | 8.0 | 5.0 | 1 |
| 632 | 2011 | 9.0 | 3.0 | 1 |
| 633 | 2011 | 9.0 | 6.0 | 1 |
| 634 | 2011 | 10.0 | 3.0 | 1 |
| 635 | 2011 | 11.0 | 7.0 | 1 |
| 636 | 2011 | 12.0 | 5.0 | 1 |
| 672 | 2012 | 22.0 | 15.0 | 1 |
| 673 | 2012 | 9.0 | 6.0 | 1 |
| 674 | 2012 | 9.0 | 14.0 | 1 |
| 675 | 2012 | 10.0 | 7.0 | 1 |
| 676 | 2012 | 12.0 | 3.0 | 1 |
| 677 | 2012 | 12.0 | 9.0 | 1 |
| 678 | 2012 | 13.0 | 3.0 | 1 |
| 679 | 2012 | 9.0 | 4.0 | 1 |
| 680 | 2012 | 9.0 | 3.0 | 1 |
| 681 | 2012 | 8.0 | 10.0 | 1 |
| 682 | 2012 | 8.0 | 7.0 | 1 |
| 683 | 2012 | 8.0 | 2.0 | 1 |
| 684 | 2012 | 8.0 | 1.0 | 1 |
| 685 | 2012 | 7.0 | 5.0 | 1 |
| 686 | 2012 | 6.0 | 12.0 | 1 |
| 687 | 2012 | 6.0 | 10.0 | 1 |
| 688 | 2012 | 6.0 | 9.0 | 1 |
| 689 | 2012 | 6.0 | 7.0 | 1 |
| 690 | 2012 | 5.0 | 9.0 | 1 |
| 691 | 2012 | 4.0 | 9.0 | 1 |
| 692 | 2012 | 4.0 | 8.0 | 1 |
| 693 | 2012 | 2.0 | 5.0 | 1 |
| 694 | 2012 | 2.0 | 4.0 | 1 |
| 730 | 2013 | 10.0 | 5.0 | 1 |
| 731 | 2013 | 8.0 | 2.0 | 1 |
| 732 | 2013 | 8.0 | 6.0 | 1 |
| 733 | 2013 | 8.0 | 8.0 | 1 |
| 734 | 2013 | 9.0 | 9.0 | 1 |
| 735 | 2013 | 10.0 | 3.0 | 1 |
| 736 | 2013 | 10.0 | 4.0 | 1 |
| 737 | 2013 | 10.0 | 7.0 | 1 |
| 738 | 2013 | 13.0 | 8.0 | 1 |
| 739 | 2013 | 7.0 | 10.0 | 1 |
| 740 | 2013 | 7.0 | 4.0 | 1 |
| 741 | 2013 | 6.0 | 12.0 | 1 |
| 742 | 2013 | 6.0 | 8.0 | 1 |
| 743 | 2013 | 2.0 | 4.0 | 1 |
| 744 | 2013 | 2.0 | 5.0 | 1 |
| 745 | 2013 | 3.0 | 5.0 | 1 |
| 746 | 2013 | 3.0 | 6.0 | 1 |
| 747 | 2013 | 4.0 | 10.0 | 1 |
| 748 | 2013 | 5.0 | 8.0 | 1 |
| 749 | 2013 | 5.0 | 9.0 | 1 |
| 788 | 2014 | 12.0 | 3.0 | 1 |
| 789 | 2014 | 12.0 | 8.0 | 1 |
| 790 | 2014 | 12.0 | 16.0 | 1 |
| 791 | 2014 | 8.0 | 10.0 | 1 |
| 792 | 2014 | 11.0 | 4.0 | 1 |
| 793 | 2014 | 8.0 | 6.0 | 1 |
| 794 | 2014 | 8.0 | 5.0 | 1 |
| 795 | 2014 | 2.0 | 4.0 | 1 |
| 796 | 2014 | 6.0 | 11.0 | 1 |
| 797 | 2014 | 6.0 | 12.0 | 1 |
| 798 | 2014 | 7.0 | 9.0 | 1 |
| 799 | 2014 | 7.0 | 14.0 | 1 |
| 838 | 2015 | 5.0 | 10.0 | 1 |
| 839 | 2015 | 5.0 | 1.0 | 1 |
| 840 | 2015 | 5.0 | 8.0 | 1 |
| 841 | 2015 | 5.0 | 9.0 | 1 |
| 842 | 2015 | 6.0 | 8.0 | 1 |
| 843 | 2015 | 7.0 | 8.0 | 1 |
| 844 | 2015 | 7.0 | 9.0 | 1 |
| 845 | 2015 | 7.0 | 12.0 | 1 |
| 846 | 2015 | 9.0 | 4.0 | 1 |
| 847 | 2015 | 10.0 | 4.0 | 1 |
| 848 | 2015 | 10.0 | 7.0 | 1 |
| 849 | 2015 | 22.0 | 19.0 | 1 |
| 850 | 2015 | 14.0 | 6.0 | 1 |
| 886 | 2016 | 2.0 | 4.0 | 1 |
| 887 | 2016 | 5.0 | 8.0 | 1 |
| 888 | 2016 | 10.0 | 5.0 | 1 |
| 889 | 2016 | 7.0 | 5.0 | 1 |
| 890 | 2016 | 8.0 | 2.0 | 1 |
| 891 | 2016 | 9.0 | 4.0 | 1 |
| 892 | 2016 | 9.0 | 5.0 | 1 |
| 893 | 2016 | 9.0 | 6.0 | 1 |
| 894 | 2016 | 10.0 | 2.0 | 1 |
| 895 | 2016 | 10.0 | 7.0 | 1 |
| 896 | 2016 | 10.0 | 10.0 | 1 |
| 897 | 2016 | 11.0 | 6.0 | 1 |
| 898 | 2016 | 12.0 | 12.0 | 1 |
| 940 | 2017 | 14.0 | 17.0 | 1 |
| 941 | 2017 | 14.0 | 4.0 | 1 |
| 942 | 2017 | 14.0 | 8.0 | 1 |
| 943 | 2017 | 13.0 | 12.0 | 1 |
| 944 | 2017 | 13.0 | 3.0 | 1 |
| 945 | 2017 | 9.0 | 4.0 | 1 |
| 946 | 2017 | 7.0 | 9.0 | 1 |
| 947 | 2017 | 7.0 | 10.0 | 1 |
| 948 | 2017 | 7.0 | 12.0 | 1 |
| 949 | 2017 | 8.0 | 2.0 | 1 |
| 950 | 2017 | 8.0 | 6.0 | 1 |
| 951 | 2017 | 9.0 | 2.0 | 1 |
| 952 | 2017 | 9.0 | 8.0 | 1 |
| 953 | 2017 | 6.0 | 1.0 | 1 |
| 954 | 2017 | 9.0 | 9.0 | 1 |
| 955 | 2017 | 10.0 | 10.0 | 1 |
| 956 | 2017 | 11.0 | 6.0 | 1 |
| 957 | 2017 | 12.0 | 3.0 | 1 |
| 958 | 2017 | 12.0 | 5.0 | 1 |
| 959 | 2017 | 12.0 | 6.0 | 1 |
| 960 | 2017 | 6.0 | 9.0 | 1 |
| 961 | 2017 | 5.0 | 9.0 | 1 |
| 962 | 2017 | 4.0 | 10.0 | 1 |
| 963 | 2017 | 2.0 | 4.0 | 1 |
| 964 | 2017 | 3.0 | 6.0 | 1 |
| 965 | 2017 | 4.0 | 7.0 | 1 |
| 1011 | 2018 | 5.0 | 1.0 | 1 |
| 1012 | 2018 | 14.0 | 17.0 | 1 |
| 1013 | 2018 | 9.0 | 8.0 | 1 |
| 1014 | 2018 | 10.0 | 2.0 | 1 |
| 1015 | 2018 | 10.0 | 5.0 | 1 |
| 1016 | 2018 | 11.0 | 4.0 | 1 |
| 1017 | 2018 | 11.0 | 8.0 | 1 |
| 1018 | 2018 | 12.0 | 4.0 | 1 |
| 1019 | 2018 | 14.0 | 7.0 | 1 |
| 1020 | 2018 | 8.0 | 8.0 | 1 |
| 1021 | 2018 | 5.0 | 9.0 | 1 |
| 1022 | 2018 | 6.0 | 10.0 | 1 |
| 1023 | 2018 | 7.0 | 6.0 | 1 |
| 1024 | 2018 | 8.0 | 2.0 | 1 |
| 1063 | 2019 | 14.0 | 8.0 | 1 |
| 1064 | 2019 | 13.0 | 8.0 | 1 |
| 1065 | 2019 | 12.0 | 16.0 | 1 |
| 1066 | 2019 | 12.0 | 5.0 | 1 |
| 1067 | 2019 | 12.0 | 4.0 | 1 |
| 1068 | 2019 | 11.0 | 4.0 | 1 |
| 1069 | 2019 | 11.0 | 3.0 | 1 |
| 1070 | 2019 | 10.0 | 4.0 | 1 |
| 1071 | 2019 | 9.0 | 7.0 | 1 |
| 1072 | 2019 | 9.0 | 5.0 | 1 |
| 1073 | 2019 | 8.0 | 11.0 | 1 |
| 1074 | 2019 | 7.0 | 9.0 | 1 |
| 1075 | 2019 | 7.0 | 6.0 | 1 |
| 1076 | 2019 | 7.0 | 5.0 | 1 |
| 1077 | 2019 | 7.0 | 2.0 | 1 |
| 1078 | 2019 | 6.0 | 11.0 | 1 |
| 1079 | 2019 | 6.0 | 1.0 | 1 |
| 1080 | 2019 | 2.0 | 4.0 | 1 |
| 1125 | 2020 | 4.0 | 7.0 | 1 |
| 1126 | 2020 | 7.0 | 7.0 | 1 |
| 1127 | 2020 | 7.0 | 8.0 | 1 |
| 1128 | 2020 | 19.0 | 16.0 | 1 |
| 1129 | 2020 | 16.0 | 6.0 | 1 |
| 1130 | 2020 | 12.0 | 9.0 | 1 |
| 1131 | 2020 | 12.0 | 4.0 | 1 |
| 1132 | 2020 | 12.0 | 2.0 | 1 |
| 1133 | 2020 | 11.0 | 4.0 | 1 |
| 1134 | 2020 | 10.0 | 5.0 | 1 |
| 1135 | 2020 | 10.0 | 3.0 | 1 |
| 1136 | 2020 | 9.0 | 6.0 | 1 |
| 1137 | 2020 | 8.0 | 11.0 | 1 |
| 1138 | 2020 | 8.0 | 2.0 | 1 |
| 1139 | 2020 | 7.0 | 10.0 | 1 |
| 1182 | 2021 | 10.0 | 8.0 | 1 |
| 1183 | 2021 | 9.0 | 5.0 | 1 |
| 1184 | 2021 | 10.0 | 6.0 | 1 |
| 1185 | 2021 | 10.0 | 3.0 | 1 |
| 1186 | 2021 | 10.0 | 2.0 | 1 |
| 1187 | 2021 | 11.0 | 3.0 | 1 |
| 1188 | 2021 | 13.0 | 5.0 | 1 |
| 1189 | 2021 | 14.0 | 5.0 | 1 |
| 1190 | 2021 | 9.0 | 2.0 | 1 |
| 1191 | 2021 | 8.0 | 8.0 | 1 |
| 1192 | 2021 | 8.0 | 6.0 | 1 |
| 1193 | 2021 | 8.0 | 4.0 | 1 |
| 1194 | 2021 | 8.0 | 2.0 | 1 |
| 1195 | 2021 | 7.0 | 9.0 | 1 |
| 1196 | 2021 | 7.0 | 2.0 | 1 |
| 1197 | 2021 | 7.0 | 1.0 | 1 |
| 1198 | 2021 | 3.0 | 6.0 | 1 |
| 1199 | 2021 | 3.0 | 5.0 | 1 |
| 1228 | 2022 | 6.0 | 5.0 | 1 |
| 1229 | 2022 | 6.0 | 9.0 | 1 |
| 1230 | 2022 | 18.0 | 9.0 | 1 |
| 1231 | 2022 | 7.0 | 8.0 | 1 |
| 1232 | 2022 | 9.0 | 10.0 | 1 |
| 1233 | 2022 | 8.0 | 4.0 | 1 |
| 1234 | 2022 | 7.0 | 3.0 | 1 |
| 1235 | 2022 | 7.0 | 2.0 | 1 |
| 1236 | 2022 | 8.0 | 6.0 | 1 |
| 1237 | 2022 | 2.0 | 4.0 | 1 |
| 1238 | 2022 | 8.0 | 10.0 | 1 |
| 1239 | 2022 | 5.0 | 1.0 | 1 |
| 1240 | 2022 | 9.0 | 13.0 | 1 |
| 1241 | 2022 | 10.0 | 6.0 | 1 |
| 1242 | 2022 | 1.0 | 2.0 | 1 |
| 1243 | 2022 | 6.0 | 10.0 | 1 |
| 1244 | 2022 | 10.0 | 7.0 | 1 |
| 1245 | 2022 | 11.0 | 5.0 | 1 |
| 1246 | 2022 | 17.0 | 12.0 | 1 |
| 1247 | 2022 | 9.0 | 6.0 | 1 |
| 1248 | 2022 | 47.0 | 39.0 | 1 |
has_duplicates = final_real_estate.duplicated().any()
# Splitting the dataset by popularity
#1 Count how many times each (bed, bath) combination appears
combo_counts = final_real_estate.groupby(['bed', 'bath']).size().reset_index(name = 'Sales Count (Bed and Bath)')
#2 Merge the counts to the original dataset
final_real_estate_counts = final_real_estate.merge(combo_counts, on = ['bed', 'bath'], how = 'left')
#3 use quantiles to define popularity levels
q_low = final_real_estate_counts['Sales Count (Bed and Bath)'].quantile(0.33)
q_high = final_real_estate_counts['Sales Count (Bed and Bath)'].quantile(0.66)
# Create a new column for popularity
def popularity_label(count):
if count <= q_low:
return 'Least Popular'
elif count <= q_high:
return 'Average Popularity'
else:
return 'Most Popular'
# apply the function to the dataset
final_real_estate_counts['popularity'] = final_real_estate_counts['Sales Count (Bed and Bath)'].apply(popularity_label)
From the below table, it is showned that majority of the property with room ratio of less than 1 or more than 1.5 are categorised as less popular. Generally speaking less popular property do not have a very good balance of bedroom and bathroom, or property with too much bedrooms and bathroom tend to less attractive as it exceed the average people per household in the US and it may be difficult for investors to resell it due to its size and demand for the property.
least_popular_property = final_real_estate_counts[final_real_estate_counts['popularity'] == 'Least Popular']
display(least_popular_property.groupby(['bed', 'bath', 'room_ratio']).size().reset_index())
| bed | bath | room_ratio | 0 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.000000 | 347 |
| 1 | 1.0 | 2.0 | 0.500000 | 78 |
| 2 | 1.0 | 3.0 | 0.333333 | 4 |
| 3 | 1.0 | 4.0 | 0.250000 | 2 |
| 4 | 2.0 | 1.0 | 2.000000 | 1366 |
| 5 | 2.0 | 3.0 | 0.666667 | 346 |
| 6 | 2.0 | 4.0 | 0.500000 | 25 |
| 7 | 2.0 | 5.0 | 0.400000 | 2 |
| 8 | 3.0 | 4.0 | 0.750000 | 456 |
| 9 | 3.0 | 5.0 | 0.600000 | 56 |
| 10 | 3.0 | 6.0 | 0.500000 | 10 |
| 11 | 3.0 | 7.0 | 0.428571 | 1 |
| 12 | 3.0 | 8.0 | 0.375000 | 1 |
| 13 | 4.0 | 1.0 | 4.000000 | 307 |
| 14 | 4.0 | 4.0 | 1.000000 | 1187 |
| 15 | 4.0 | 5.0 | 0.800000 | 329 |
| 16 | 4.0 | 6.0 | 0.666667 | 79 |
| 17 | 4.0 | 7.0 | 0.571429 | 25 |
| 18 | 4.0 | 8.0 | 0.500000 | 3 |
| 19 | 4.0 | 9.0 | 0.444444 | 2 |
| 20 | 4.0 | 10.0 | 0.400000 | 2 |
| 21 | 5.0 | 1.0 | 5.000000 | 24 |
| 22 | 5.0 | 2.0 | 2.500000 | 432 |
| 23 | 5.0 | 3.0 | 1.666667 | 695 |
| 24 | 5.0 | 4.0 | 1.250000 | 576 |
| 25 | 5.0 | 5.0 | 1.000000 | 379 |
| 26 | 5.0 | 6.0 | 0.833333 | 202 |
| 27 | 5.0 | 7.0 | 0.714286 | 90 |
| 28 | 5.0 | 8.0 | 0.625000 | 25 |
| 29 | 5.0 | 9.0 | 0.555556 | 15 |
| 30 | 5.0 | 10.0 | 0.500000 | 3 |
| 31 | 6.0 | 1.0 | 6.000000 | 3 |
| 32 | 6.0 | 2.0 | 3.000000 | 227 |
| 33 | 6.0 | 3.0 | 2.000000 | 285 |
| 34 | 6.0 | 4.0 | 1.500000 | 198 |
| 35 | 6.0 | 5.0 | 1.200000 | 141 |
| 36 | 6.0 | 6.0 | 1.000000 | 104 |
| 37 | 6.0 | 7.0 | 0.857143 | 91 |
| 38 | 6.0 | 8.0 | 0.750000 | 65 |
| 39 | 6.0 | 9.0 | 0.666667 | 37 |
| 40 | 6.0 | 10.0 | 0.600000 | 19 |
| 41 | 6.0 | 11.0 | 0.545455 | 8 |
| 42 | 6.0 | 12.0 | 0.500000 | 5 |
| 43 | 7.0 | 1.0 | 7.000000 | 1 |
| 44 | 7.0 | 2.0 | 3.500000 | 27 |
| 45 | 7.0 | 3.0 | 2.333333 | 84 |
| 46 | 7.0 | 4.0 | 1.750000 | 57 |
| 47 | 7.0 | 5.0 | 1.400000 | 47 |
| 48 | 7.0 | 6.0 | 1.166667 | 23 |
| 49 | 7.0 | 7.0 | 1.000000 | 26 |
| 50 | 7.0 | 8.0 | 0.875000 | 24 |
| 51 | 7.0 | 9.0 | 0.777778 | 20 |
| 52 | 7.0 | 10.0 | 0.700000 | 10 |
| 53 | 7.0 | 11.0 | 0.636364 | 5 |
| 54 | 7.0 | 12.0 | 0.583333 | 4 |
| 55 | 7.0 | 13.0 | 0.538462 | 1 |
| 56 | 7.0 | 14.0 | 0.500000 | 1 |
| 57 | 8.0 | 1.0 | 8.000000 | 2 |
| 58 | 8.0 | 2.0 | 4.000000 | 15 |
| 59 | 8.0 | 3.0 | 2.666667 | 66 |
| 60 | 8.0 | 4.0 | 2.000000 | 56 |
| 61 | 8.0 | 5.0 | 1.600000 | 26 |
| 62 | 8.0 | 6.0 | 1.333333 | 19 |
| 63 | 8.0 | 7.0 | 1.142857 | 5 |
| 64 | 8.0 | 8.0 | 1.000000 | 6 |
| 65 | 8.0 | 9.0 | 0.888889 | 4 |
| 66 | 8.0 | 10.0 | 0.800000 | 5 |
| 67 | 8.0 | 11.0 | 0.727273 | 4 |
| 68 | 8.0 | 13.0 | 0.615385 | 3 |
| 69 | 9.0 | 2.0 | 4.500000 | 2 |
| 70 | 9.0 | 3.0 | 3.000000 | 65 |
| 71 | 9.0 | 4.0 | 2.250000 | 24 |
| 72 | 9.0 | 5.0 | 1.800000 | 13 |
| 73 | 9.0 | 6.0 | 1.500000 | 15 |
| 74 | 9.0 | 7.0 | 1.285714 | 3 |
| 75 | 9.0 | 8.0 | 1.125000 | 3 |
| 76 | 9.0 | 9.0 | 1.000000 | 7 |
| 77 | 9.0 | 10.0 | 0.900000 | 6 |
| 78 | 9.0 | 13.0 | 0.692308 | 1 |
| 79 | 9.0 | 14.0 | 0.642857 | 2 |
| 80 | 9.0 | 17.0 | 0.529412 | 1 |
| 81 | 10.0 | 2.0 | 5.000000 | 3 |
| 82 | 10.0 | 3.0 | 3.333333 | 8 |
| 83 | 10.0 | 4.0 | 2.500000 | 16 |
| 84 | 10.0 | 5.0 | 2.000000 | 8 |
| 85 | 10.0 | 6.0 | 1.666667 | 3 |
| 86 | 10.0 | 7.0 | 1.428571 | 5 |
| 87 | 10.0 | 8.0 | 1.250000 | 3 |
| 88 | 10.0 | 10.0 | 1.000000 | 2 |
| 89 | 10.0 | 12.0 | 0.833333 | 1 |
| 90 | 10.0 | 14.0 | 0.714286 | 1 |
| 91 | 11.0 | 3.0 | 3.666667 | 4 |
| 92 | 11.0 | 4.0 | 2.750000 | 4 |
| 93 | 11.0 | 5.0 | 2.200000 | 3 |
| 94 | 11.0 | 6.0 | 1.833333 | 4 |
| 95 | 11.0 | 7.0 | 1.571429 | 1 |
| 96 | 11.0 | 8.0 | 1.375000 | 2 |
| 97 | 11.0 | 9.0 | 1.222222 | 1 |
| 98 | 11.0 | 11.0 | 1.000000 | 1 |
| 99 | 11.0 | 12.0 | 0.916667 | 1 |
| 100 | 12.0 | 2.0 | 6.000000 | 2 |
| 101 | 12.0 | 3.0 | 4.000000 | 6 |
| 102 | 12.0 | 4.0 | 3.000000 | 3 |
| 103 | 12.0 | 5.0 | 2.400000 | 3 |
| 104 | 12.0 | 6.0 | 2.000000 | 9 |
| 105 | 12.0 | 7.0 | 1.714286 | 2 |
| 106 | 12.0 | 8.0 | 1.500000 | 1 |
| 107 | 12.0 | 9.0 | 1.333333 | 3 |
| 108 | 12.0 | 12.0 | 1.000000 | 1 |
| 109 | 12.0 | 13.0 | 0.923077 | 1 |
| 110 | 12.0 | 16.0 | 0.750000 | 2 |
| 111 | 13.0 | 3.0 | 4.333333 | 2 |
| 112 | 13.0 | 5.0 | 2.600000 | 1 |
| 113 | 13.0 | 7.0 | 1.857143 | 1 |
| 114 | 13.0 | 8.0 | 1.625000 | 2 |
| 115 | 13.0 | 12.0 | 1.083333 | 1 |
| 116 | 13.0 | 14.0 | 0.928571 | 1 |
| 117 | 14.0 | 4.0 | 3.500000 | 1 |
| 118 | 14.0 | 5.0 | 2.800000 | 1 |
| 119 | 14.0 | 6.0 | 2.333333 | 1 |
| 120 | 14.0 | 7.0 | 2.000000 | 1 |
| 121 | 14.0 | 8.0 | 1.750000 | 2 |
| 122 | 14.0 | 15.0 | 0.933333 | 1 |
| 123 | 14.0 | 17.0 | 0.823529 | 2 |
| 124 | 15.0 | 6.0 | 2.500000 | 1 |
| 125 | 15.0 | 9.0 | 1.666667 | 1 |
| 126 | 16.0 | 6.0 | 2.666667 | 1 |
| 127 | 16.0 | 9.0 | 1.777778 | 1 |
| 128 | 17.0 | 12.0 | 1.416667 | 1 |
| 129 | 18.0 | 9.0 | 2.000000 | 1 |
| 130 | 19.0 | 16.0 | 1.187500 | 1 |
| 131 | 19.0 | 19.0 | 1.000000 | 1 |
| 132 | 20.0 | 18.0 | 1.111111 | 1 |
| 133 | 22.0 | 15.0 | 1.466667 | 1 |
| 134 | 22.0 | 19.0 | 1.157895 | 1 |
| 135 | 47.0 | 39.0 | 1.205128 | 1 |
From the analysis, a 3 bedroom and 2-3 bathroom property are the most popular because the demand for a property is heavily driven by the usage of the property, on average a US household consist of 3.15 people, thus property with this size is the most popular.
most_popular_property = final_real_estate_counts[final_real_estate_counts['popularity'] == 'Most Popular']
# identify the unique room_ratio
most_popular_property.groupby(['bed', 'bath','room_ratio']).size().reset_index()
| bed | bath | room_ratio | 0 | |
|---|---|---|---|---|
| 0 | 3.0 | 2.0 | 1.5 | 5255 |
| 1 | 3.0 | 3.0 | 1.0 | 2846 |
The top 5 States for real estate sales in the US are New Jersey, Connecticut, New York, Pennsylvania and Massachusetts for the past 22 years. If this trend continues, the real estates in these states may be potential investment options as the market is quite active, thus more liquid. However, real estates in these states may be potentially more expensive, but at the same time these states may have more employment opportunities. If quality of life such as pollution rates are of concern, other less crowded states maybe better choices for living.
Moreover, real estate values are mainly driven by value generating drivers such as location and the property's ability to generate rent rather than just real estate size. It is important to note that the balance between bathroom and bedroom is very important indicating the liquidity of a property, demands for property is mainly driven by needs of the buyer. If the buying motive is for investment purposes (potential future sale or rent generation), it is important to look into what the target market is demanding.
Furthermore, a 3 bedroom and 2 bathroom real estate is the most demanded real estate for the past 22 years and if this trend continues to the future, this is mainly driven by the average household size in the US. This type of real estate would be a safe and great investment option.